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Abstract. Sequential pattern mining under constraints is a challenging data min¬ 
ing task. Many efficient ad hoc methods have been developed for mining sequen¬ 
tial patterns, but they are all suffering from a lack of genericity. Recent works 
have investigated Constraint Programming (CP) methods, but they are not still 
effective because of their encoding. In this paper, we propose a global constraint 
based on the projected databases principle which remedies to this drawback. Fx- 
periments show that our approach clearly outperforms CP approaches and com¬ 
petes well with ad hoc methods on large datasets. 


I Introduction 

Mining useful patterns in sequential data is a challenging task. Sequential pattern min¬ 
ing is among the most important and popular data mining task with many real applica¬ 
tions such as the analysis of web click-streams, medical or biological data and textual 
data. For effectiveness and efficiency considerations, many authors have promoted the 
use of constraints to focus on the most promising patterns according to the interests 
given by the final user. In line with ifTSl . many efficient ad hoc methods have been de¬ 
veloped but they suffer from a lack of genericity to handle and to push simultaneously 
sophisticated combination of various types of constraints. Indeed, new constraints have 
to be hand-coded and their combinations often require new implementations. 

Recently, several proposals have investigated relationships between sequential pat¬ 
tern mining and constraint programming (CP) to revisit data mining tasks in a declar¬ 
ative and generic way II5I11I9I12I . The great advantage of these approaches is their 
flexibility. The user can model a problem and express his queries by specifying what 
constraints need to be satisfied. But, all these proposals are not effective enough because 
of their CP encoding. Consequently, the design of new efficient declarative models for 
mining useful patterns in sequential data is clearly an important challenge for CP. 

To address this challenge, we investigate in this paper the other side of the cross fer¬ 
tilization between data-mining and constraint programming, namely how the CP frame¬ 
work can benefit from the power of candidate pruning mechanisms used in sequential 
pattern mining. First, we introduce the global constraint Prefix-Projection for se¬ 
quential pattern mining. Prefix-Projection uses a concise encoding and its filtering 




relies on the principle of projected databases m. The key idea is to divide the initial 
database into smaller ones projected on the frequent subsequences obtained so far, then, 
mine locally frequent patterns in each projected database by growing a frequent prefix. 
This global constraint utilizes the principle of prefix-projected database to keep only 
locally frequent items alongside projected databases in order to remove infrequent ones 
from the domains of variables. Second, we show how the concise encoding allows for 
a straightforward implementation of the frequency constraint (Prefix-Projection 
constraint) and constraints on patterns such as size, item membership and regular ex¬ 
pressions and the simultaneous combination of them. Finally, experiments show that 
our approach clearly outperforms CP approaches and competes well with ad hoc meth¬ 
ods on large datasets for mining frequent sequential patterns or patterns under various 
constraints. It is worth noting that the experiments show that our approach achieves 
scalability while it is a major issue of CP approaches. 

The paper is organized as follows. Section recalls preliminaries. Section [^pro¬ 
vides a critical review of ad hoc methods and CP approaches for sequential pattern 
mining. Section [^ presents the global constraint Prefix-Projection. Section [^re¬ 
ports experiments we performed. Finally, we conclude and draw some perspectives. 

2 Preliminaries 

This section presents background knowledge about sequential pattern mining and con¬ 
straint satisfaction problems. 

2.1 Sequential Patterns 

Let I be a finite set of items. The language of sequences corresponds to Cx = I” where 
n e N+. 

Definition 1 (sequence, sequence database). A sequence s over Ci is an ordered list 
(siS 2 ... Sn), where Si, 1 < i < n, is an item, n is called the length of the sequence s. 
A sequence database SDB is a set of tuples (sid, s), where sid is a sequence identifier 
and s a sequence. 

Definition 2 (subsequence, ^ relation). A sequence a = (ai... am) B a subse¬ 
quence of s = (si... Sn), denoted by (a < s), if m < n and there exist integers 
1 < ji < ... < < tt, such that tti = Sj. for all 1 < i < m. We also say that a is 

contained in s or s is a super-sequence of a. For example, the sequence {BABC) is a 
super-sequence of {AC) : {AC) ^ {BABC). A tuple {sid, s) contains a sequence a, 
if a s. 

The cover of a sequence p in SDB is the set of all tuples in SDB in which p is 
contained. The support of a sequence p in SDB is the number of tuples in SDB which 
contain p. 

Definition 3 (coverage, support). Let SDB be a sequence database and p a sequence, 
coversDB{p)={{sid,s) G SDB \ pf, s) and supsdb{p) = #coversDB{p)- 


sid 

Sequence 

1 

{ABCBC) 

2 

(BABC) 

3 

{AB) 

4 

{BCD) 


Table 1: SDBi. a sequence database example. 


Definition 4 (sequential pattern). Given a minimum support threshold minsup, every 
sequence p such that supsosip) > minsup is called a sequential pattern p is said 
to be frequent in SDB. 


Example 1. Table [T] represents a sequence database of four sequences where the set of 
items is I = {A, B, C, D}. Let the sequence p = {AC). We have cover sdbi{p) = 
{(1, si), (2, S 2 )}. If we consider minsup = 2, p = (AC) is a sequential pattern be¬ 
cause SUpsDBi (p) > 2. 

Definition 5 (sequential pattern mining (SPM)). Given a sequence database SDB 
and a minimum support threshold minsup. The problem of sequential pattern mining 
is to find all patterns p such that supsdb{p) > minsup. 

2.2 SPM under Constraints 

In this section, we define the problem of mining sequential patterns in a sequence 
database satisfying user-defined constraints Then, we review the most usual constraints 
for the sequential mining problem ca. 

Problem statement. Given a constraint C{p) on pattern p and a sequence database 
SDB, the problem of constraint-based pattern mining is to find the complete set of 
patterns satisfying C{p). In the following, we present different types of constraints that 
we explicit in the context of sequence mining. All these constraints will be handled by 
our concise encoding (see Sections [4.2| and [43| l. 

- The minimum size constraint size{p, Imin) states that the number of items of p must 
be greater than or equal to Imin- 

- The item constraint item{p, f) states that an item t must belong (or not) to a pattern p. 

- The regular expression constraint Q reg{p^ exp) states that a pattern p must be ac¬ 
cepted by the deterministic finite automata associated to the regular expression exp. 

2.3 Projected Databases 

We now present the necessary definitions related to the concept of projected databases m. 

Definition 6 (prefix, projection, suffix). Let /3 = (/3i... fin) and a = {ai... am) be 

two sequences, where m < n. 

- Sequence a is called the prefix of fi iff^i € [l..m], ai = fii. 

- Sequence fi = {fii .. . fin) is called the projection of some sequence s w.r.t. a, iff (1) 





/3 ^ s, (2) a is a prefix of j 3 and (3) there exists no proper super-sequence fi' of such 
that P' and fi' also has a as prefix. 

- Sequence 7 = (/3m+i • ■ ■ Pn) called the suffix of s w.r.t. a. With the standard con¬ 
catenation operator "concat", we have p = concat{a, 7 ). 

Definition 7 (projected database). Let SDB be a sequence database, the a-projected 
database, denoted by SDB\a, is the collection of suffixes of sequences in SDB w.r.t. 
prefix a. 

llT4ll have proposed an efficient algorithm, called PrefixSpan, for mining se¬ 
quential patterns based on the concept of projected databases. It proceeds by dividing 
the initial database into smaller ones projected on the frequent subsequences obtained 
so far; only their corresponding suffixes are kept. Then, sequential patterns are mined 
in each projected database by exploring only locally frequent patterns. 

Example 2. Let us consider the sequence database of Table [T] with minsup = 2. 
PrefixSpan starts by scanning SDBi to find all the frequent items, each of them 
is used as a prefix to get projected databases. For SDBi, we get 3 disjoint subsets w.r.t. 
the prefixes (A), (B), and (C). For instance, SDBp consists of 3 suffix sequences; 

{(1, (BCBC)), (2, (BC)), (3, (B))}. Consider the projected database SDBi\^a>, its 
locally frequent items are B and C. Thus, SDBi\^a> can be recursively partitioned 
into 2 subsets w.r.t. the two prefixes (AB) and {AC). The (AB)- and {AC)- projected 
databases can be constructed and recursively mined similarly. The processing of a a- 
projected database terminates when no frequent subsequence can be generated. 

Proposition[^establishes the support count of a sequence 7 in SDB\a llT4l : 

Proposition 1 (Support count). For any sequence 7 in SDB with prefix a and sujfix 
P s.t. 7 = concat{a,P), supsdb{i) = supsusL(Z?)- 

This proposition ensures that only the sequences in SDB grown from a need to be 
considered for the support count of a sequence 7 . Furthermore, only those suffixes with 
prefix a should be counted. 

2.4 CSP and Global Constraints 

A Constraint Satisfaction Problem (CSP) consists of a set X of n variables, a domain D 
mapping each variable S AT to a finite set of values D{Xi), and a set of constraints C. 
An assignment cr is a mapping from variables in X to values in their domains; S 
X,a{Xi) G D{Xi). A constraint c G C is a subset of the cartesian product of the 
domains of the variables that are in c. The goal is to find an assignment such that all 
constraints are satisfied. 

Domain consistency (DC). Constraint solvers typically use backtracking search to ex¬ 
plore the space of partial assignments. At each assignment, filtering algorithms prune 
the search space by enforcing local consistency properties like domain consistency. A 
constraint c on X is domain consistent, if and only if, for every X^ G X and for ev¬ 
ery di G D{Xi), there is an assignment cr satisfying c such that cr(Xi) = di. Such an 
assignment is called a support. 


Global constraints provide shorthands to often-used combinatorial substructures. We 
present two global constraints. Let X = {Xi,X 2 , ..., X^) be a sequence of n variables. 
Let y be a set of values, I and u be two integers s.t. 0 < I < u < n, the con¬ 
straint Among(X, y, Z, m) states that each value a € V should occur at least I times 
and at most u times in X a. Given a deterministic finite automaton A, the constraint 
Regular(X, A) ensures that the sequence X is accepted by A ifTbl . 

3 Related works 

This section provides a critical review of ad hoc methods and CP approaches for SPM. 

3.1 Ad hoc Methods for SPM 

Gsp ini was the first algorithm proposed to extract sequential patterns. It uses a 
generate-and test approach. Later, two major classes of methods have been proposed; 

- Depth-first search based on a vertical database format e.g. cSpade incorporating 
contraints (max-gap, max-span, length) 121], SPADE 1221 or SPAM fj]. 

- Projected pattern growth such asPrefixSpan Cl and its extensions, e.g. CloSpan 
for mining closed sequential patterns El or Gap-BIDE ITOl tackling the gap con¬ 
straint. 

In 17], the authors proposed SPIRIT based on GSP for SPM with regular expres¬ 
sions. Later, ca introduces Sequence Mining Automata (SMA), a new approach based 
on a specialized kind of Petri Net. Two variants of SMA were proposed; SMA-IP (SMA 
one pass) and SMA-FC (SMA Full Check). SMA-IP processes by means of the SMA 
all sequences one by one, and enters all resulting valid patterns in a hash table for 
support counting, while SMA-FC allows frequency based pruning during the scan of 
the database. Finally, ca provides a survey for other constraints such as regular ex¬ 
pressions, length and aggregates. But, all these proposals, though efficient, are ad hoc 
methods suffering from a lack of genericity. Adding new constraints often requires to 
develop new implementations. 

3.2 CP Methods for SPM 

Following the work of ID for itemset mining, several methods have been proposed to 
mine sequential patterns using CP. 

Proposals. Q have proposed a first SAT-based model for discovering a special class 
of patterns with wildcard^ in a single sequence under different types of constraints 
(e.g. frequency, maximality, closedness). have proposed a CSP model for SPM. 
Each sequence is encoded by an automaton capturing all subsequences that can occur 
in it. 0 have proposed a CSP model for SPM with wildcards. They show how some 
constraints dealing with local patterns (e.g. frequency, size, gap, regular expressions) 
and constraints defining more complex patterns such as relevant subgroups El and 
top-fc patterns can be modeled using a CSP. El have proposed two CP encodings 
for the SPM. The first one uses a global constraint to encode the subsequence relation 

* A wildcard is a special symbol that matches any item of X including itself. 



(denoted global-p. f), while the second one encodes explicitly this relation using 
additional variables and constraints (denoted decomposed-p. f). 

All these proposals use reified constraints to encode the database. A reified con¬ 
straint associates a boolean variable to a constraint reflecting whether the constraint 
is satisfied (value 1) or not (value 0). For each sequence s of SDB, a reified con¬ 
straint, stating whether (or not) the unknown pattern p is a subsequence of s, is im¬ 
posed: {Ss = 1) (p ^ s). A great consequence is that the encoding of the frequency 
measure is straightforward: freq{p) = J^sgSdb But such an encoding has a ma¬ 
jor drawback since it requires (to = ^SDB) reified constraints to encode the whole 
database. This constitutes a strong limitation of the size of the databases that could be 
managed. 

Most of these proposals encode the subsequence relation (p ^ s) using variables 
PoSsj (s G SDB and 1 < j < £) to determine a position where p occurs in s. Such an 
encoding requires a large number of additional variables (tox£) and makes the labeling 
computationally expensive. In order to address this drawback, Ha have proposed a 
global constraint exists-embedding to encode the subsequence relation, and used 
projected frequency within an ad hoc specific branching strategy to keep only frequent 
items before branching over the variables of the pattern. But, this encoding still relies on 
reified constraints and requires to impose to exist s-embedding global constraints. 

So, we propose in the next section the Prefix-Projection global constraint that 
fully exploits the principle of projected databases to encode both the subsequence re¬ 
lation and the frequency constraint. Prefix-Projection does not require any reified 
constraints nor any extra variables to encode the subsequence relation. As a conse¬ 
quence, usual SPM constraints (see Section \2.2\ can be encoded in a straightforward 
way using directly the (global) constraints of the CP solver. 


4 Prefix-Projection Global Constraint 

This section presents the Prefix-Projection global constraint for the SPM problem. 

4.1 A Concise Encoding 

Let P be the unknown pattern of size £ we are looking for. The symbol □ stands for an 
empty item and denotes the end of a sequence. The unknown pattern P is encoded with 
a sequence of i variables (Pi, P 2 , ■ ■ ■, Pe) s.t. Vf G [1. .. £],D{Pi) = IU {□}. There 
are two basic rules on the domains: 

1. To avoid the empty sequence, the first item of P must be non empty, so (□ ^ Di ). 

2. To allow patterns with less than £ items, we impose that Vi G [!..(£—1)], {Pi = 

□ ) ^ (P,+i = □). 

4.2 Definition and Consistency Checking 

The global constraint Prefix-Projection ensures both subsequence relation and 
minimum frequency constraint. 


Definition 8 (Prefix-Projection global constraint). Let P = {Pi,P 2 ,... ,Pe) 
be a pattern of size 1. (di, € D{Pi) x ... x D[Pi) is a solution of Prefix- 

ProJECTION {P, SDB, minsup) iff supsoBiidi, > minsup. 

Proposition 2. A Preeix-Projection (P, SDB, minsup) constraint has a solution 
if and only if there exists an assignment a = {di,..., df) of variables of P s.t. SDB\^ 
has at least minsup suffixes of a: ffSDB\^ > minsup. 

Proof: This is a direct consequence of proposition [T] We have straightforwardly 
supsDB{cr) = supsDB\^{{)) = ffSDB\c,. Thus, suffixes of SDB\rj are supports of tr 
in the constraint Prefix-Projection (P, SDB, minsup), provided that ffSDB\c, > 
minsup. □ 

The following proposition characterizes values in the domain of unassigned (i.e. fu¬ 
ture) variable P^+i that are consistent with the current assignment of variables (Pi,..., P^) 

Proposition 3. Let c^= (di,..., di) be a current assignment of variables (Pi,..., Pj), 
P^+l be a future variable. A value d S P(Pi+i) appears in a solution for Prefix- 
PrOJECTION (P, SDB, minsup) if and only if d is a frequent item in SDB\rj: 

ff{{sid,y)\{sid,^) € SDB\c A {(fP-y} > minsup 

Proof: Suppose that value d G P(Pi+i) occurs in SDBjg. more than minsup. From 
proposition fT| we have supsDB{concat{a, (d))) = supgii)g\^{{d)). Hence, the assign¬ 
ment a U (a) satisfies the constraint, so d G P(Pi_|_i) participates in a solution. □ 
Anti-monotonicity of the freqnency measure. If a pattern p is not frequent, then any 
pattern p' satisfying pPp' is not frequent. From proposition and according to the 
anti-monotonicity property, we can derive the following pruning rule: 

Proposition 4. Let a = {di,... ,di) be a current assignment of variables (Pi ,..., Pi). 
All values d G D(Pi^i) that are locally not frequent in SDB\fj can be pruned from 
the domain of variable Pi+i. Moreover, these values d can also be pruned from the 
domains of variables Pj with j G [z -f 2,..., £]. 

Proof: Let a = (di,..., di) be a current assignment of variables (Pi,..., Pi). Let 
d G P(Pi+i) s.t. a' = concat{a, (d)). Suppose that d is not frequent in SDB\a. 
According to proposition[^ supsDB\,,(.{d)) = supsdb{<^') < minsup, thus o' is not 
frequent. So, d can be pruned from the domain of Pi+i. 

Suppose that the assignment cr has been extended to concatff, a), where a corresponds 
to the assignment of variables Pj (with j > i). If d G P(Pi+i) is not frequent, it is 
straightforward that sup 5 £) 5 |^(concaf(a, (d))) < supsDB\,,{{d)) < minsup. Thus, 
if d is not frequent in SDB\^, it will be also not frequent in SDB\concat(a,a) ■ So, d can 
be pruned from the domains of Pj with j G [z -f 2, ...,£]. □ 

Example 3. Consider the sequence database of Table [T] with minsup = 2. Let P = 
(Pi, P 2 , P 3 ) with D{Pi) = X and P(P 2 ) = D{P'f) = ZU {□}. Suppose that (t(Pi) = 
A, Prefix-Projection(P, SDB, minsup) will remove values A and D from P(P 2 ) 
and D{Pf}, since the only locally frequent items in SDBi\^a> are B and C. 

^ We indifferently denote a by (di,..., di) or by {a{Pi),..., a (Pi)). 



Algorithm 1: ProjectSDB(5'DB, ProjSDB, a) 


Data: SDB\ initial database; ProjSDB'. projected sequences; a: prefix 

begin 

1 SDB\c ^ 0 ; 

2 for each pair {sid, start) £ ProjSDB do 

3 s ■<— SDB[sid] ; 

4 posa 1; posa t— Start ; 

5 while {poSa < #a A poSa < jl^s) do 

6 if {a\poSa\ = s[poSs]) then 

7 |_ pOSa •«— pOSa + 1 ; 

8 pOSa ■<— pOSa + 1 ; 


9 

10 


if (poSa = + 1) then 

SDB\a ■(— SDB\aU {{sid,poSa)} 


11 


return S-Di3|c, ; 


Proposition 1^ guarantees that any value (i.e. item) d S D{Pi^i) present but not 
frequent in SDB\a- does not need to be considered when extending a, thus avoiding 
searching over it. Clearly, our global constraint encodes the anti-monotonicity of the 
frequency measure in a simple and elegant way, while CP methods for SPM have diffi¬ 
culties to handle this property. In ifT^ . this is achieved by using very specific propaga¬ 
tors and branching strategies, making the integration quite complex (see ifT^ '). 

4.3 Building the projected databases. 

The key issue of our approach lies in the construction of the projected databases. When 
projecting a prefix, instead of storing the whole suffix as a projected subsequence, one 
can represent each suffix by a pair {sid, start) where sid is the sequence identifier and 
start is the starting position of the projected suffix in the sequence sid. For instance, let 
us consider the sequence database of Tabled As shown in examplej^ con¬ 

sists of 3 suffix sequences; {(1, {BCBC)),{2, {BC)), (3, {B))}. By using the pseudo- 
projection, SDB\^^^ can be represented by the following three pairs: {(1, 2), (2, 3), 

(3, 2)}. This is the principle of pseudo-projection, adopted in Pref ixSpan, exploited 
during the filtering step of our Prefix-Projection global constraint. Algorithm[T]de- 
tails this principle. It takes as input a set of projected sequences ProjSDB and a prefix 
a. The algorithm processes all the pairs {sid, start) of ProjSDB one by one (line|^, 
and searches for the lowest location of a in the sequence s corresponding to the sid of 
that sequence in SDB (lines |6]|^. 

In the worst case, ProjectSDB processes all the items of all sequences. So, the 
time complexity is 0{£ x m), with m = jfSDB and £ is the length of the longest 
sequence in SDB. The worst case space complexity of pseudo-projection is 0{m), 
since we need to store for each sequence only a pair (sid, start), while for the standard 
projection the space complexity is 0{m x £). Clearly, the pseudo-projection takes much 
less space than the standard projection. 








Algorithm 2: Filter-Prefix-Projection(S'_DB, a, i, P, minsup) 

Data: SDB: initial database; a: current prefix {a{Pi),..., a{Pi))\ minsup: the 
minimum support threshold; PSPB: internal data structure of 
Prefix-Projection for storing pseudo-projected databases 

begin 

if (i > 2 A (j{Pi) = □) then 
for j i -I-1 to £ do 

L Pj ^ 

return True; 

else 

VSVBi ^ PROJECTSDB(5'DB,P5IlZ3i-i, (cr(P»))); 
if {ij^VSVBi < minsup) then 
1^ return False ; 

else 

PI ^ getFreqItems{SDB, VSVBi, minsup) ; 

for jP— i + 1 to £ do 

foreach a e D{Pj) s.t.{a 7 ^ □ A a ^ PI) do 

[ L P(P^) ^ P(P^) - W; 

return True; 

Function getFreqItems (SDB, ProjSDB, minsup ); 

Data: SDB: the initial database; ProjSDB: pseudo-projected database; minsup: the 
minimum support threshold; Existsitem, SupCount: internal data structures 
using a hash table for support counting over items; 

begin 

13 SupCount\\ -E- {0,0}; P 0 ; 

14 for each pair {sid, start) € ProjSDB do 

15 ExistsItemW ^ {false,..., false}; s SDB[sid] ; 

16 for i <— start to jfs do 

17 a <— s[i] ; 

18 if {-^Existsltem[a]) then 

19 SupCount[a] -G- SupCount[a] + 1 ; 

20 Existsltem[a] true; 

21 if {SupCount[a] > minsup) then 

22 |_ P P U {a}; 

23 return P; 


4.4 Filtering 

Ensuring DC on Prefix-Projection(P, mmsitp) is equivalent to finding a 
sequential pattern of length (£ — 1) and then checking whether this pattern remains a 
frequent pattern when extended to any item di in D{Pi). Thus, finding such an assign¬ 
ment (i.e. support) is as much as difficult than the original problem of sequential pattern 






mining. Il20l has proved that the problem of counting the number of maxima^ frequent 
patterns in a database of sequences is #P-complete, thereby proving the NP-hardness of 
the problem of mining maximal frequent sequences. The difficulty is due to the expo¬ 
nential number of candidates that should be parsed to find the frequent patterns. Thus, 
finding, for every variable Pi G P and for every di G D{Pi), an assignment a satisfying 
Prefix-Projection(P, SDB, minsup) s.t. a{Pi) = di is of exponential nature. 

So, the filtering of the Prefix-Projection constraint maintains a consistency 
lower than DC. This consistency is based on specific properties of the projected databases 
(see Proposition [^, and anti-monotonicity of the frequency constraint (see Proposi¬ 
tion]^, and resembles forward-checking regarding Proposition]^ Prefix-Projection 
is considered as a global constraint, since all variables share the same internal data struc¬ 
tures that awake and drive the filtering. 

Algorithm ]^ describes the pseudo-code of the filtering algorithm of the Preeix- 
ProJECTION constraint. It is an incremental filtering algorithm that should be run 
when some i first variables are assigned according to the following lexicographic or¬ 
dering {Pi, P 2 ,, Pi) of variables of P. It exploits internal data-structures enabling 
to enhance the filtering algorithm. More precisely, it uses an incremental data struc¬ 
ture, denoted VSVB, that stores the intermediate pseudo-projections of SDB, where 
VSVBi {i G [0,..., fj) corresponds to the cr-projected database of the current par¬ 
tial assignment a = (a{Pi),..., <j{Pi)) (also called prefix) of variables (Pi,..., Pi), 
and VSVBq = {{sid, l)|(sic?, s) G SDB} is the initial pseudo-projected database of 
SDB (case where tr = ()). It also uses a hash table indexing the items I into integers 
(1 ... for an efficient support counting over items (see function getFreqltems). 

Algorithm]^ takes as input the current partial assignment a = (a (Pi),..., cr{Pi)) 
of variables (Pi,..., Pi), the length i of a (i.e. position of the last assigned variable 
in P) and the minimum support threshold minsup. It starts by checking if the last 
assigned variable Pi is instantiated to □ (line J^. In this case, the end of sequence is 
reached (since value □ can only appear at the end) and the sequence (a (Pi),..., <j{Pi)) 
constitutes a frequent pattern in SDB', hence the algorithm sets the remaining (i — i) 
unassigned variables to □ and returns true (lines ]2]]^. Otherwise, the algorithm com¬ 
putes incrementally VSVBi from PSVBi-i by calling function ProjectSDB (see 
Algorithm]^. Then, it checks in line 6 whether the current assignment cr is a legal pre¬ 
fix for the constraint (see Proposition 2 1 . This is done by computing the size of VSVBi. 
If this size is less than minsup, we stop growing a and we return false. Otherwise, the 
algorithm computes the set of locally frequent items Px in VSVBi by calling function 
getFreqltems (line]^. 

Function getFreqltems processes all the entries of the pseudo-projected database 
one by one, counts the number of first occurrences of items a (i.e. SupCount[a]) in 
each entry {sid, start), and keeps only the frequent ones (lines ]r3|22| . This is done 
by using Existsitem data structure. After the whole pseudo-projected database has 
been processed, the frequent items are returned (line]2^, and Algorithm]^ updates the 
current domains of variables Pj with j > (i -F 1) by pruning inconsistent values, thus 
avoiding searching over not frequent items (lines ]9pT]). 


^ A sequential pattern p is maximal if there is no sequential pattern q such that pSl- 









dataset 

#SDB 


avg (#s) 

maXsesDB (#s) 

type of data 

Leviathen 

5834 

9025 

33.81 

100 

book 

Kosarak 

69999 

21144 

7.97 

796 

web click stream 

FIFA 

20450 

2990 

34.74 

100 

web click stream 

BIBLE 

36369 

13905 

21.64 

100 

bible 

Protein 

103120 

24 

482 

600 

protein sequences 

data-200K 

200000 

20 

50 

86 

synthetic dataset 

PubMed 

17527 

19931 

29 

198 

bio-medical text 


Table 2; Dataset Characteristics. 


Proposition 5. In the worst case, filtering with Prefix-Projection global constraint 
can be achieved in 0{m x£ + mxd + ixd). The worst case space complexity of 
Prefix-Projection is 0{m x £). 

Proof: Let £ be the length of the longest sequence in SDB, m = fiSDB, and d = fiX. 
Computing the pseudo-projected database VSVBi can be done in 0{m x £)\ for each 
sequence {sid, s) of SDB, checking if a occurs in s is 0{£) and there are m sequences. 
The total complexity of function GetFreqItems is 0{m x {£ + d)). Lines ([9][TT]) can 
be achieved in 0{£ x d). So, the whole complexity is 0{m x £ + mx {£ + d) + £x d) 
= 0{m X £ + m X d + £ X d). The space complexity of the filtering algorithm lies in 
the storage of the VSDB internal data structure. In the worst case, we have to store £ 
pseudo-projected databases. Since each pseudo-projected database requires 0{m), the 
worst case space complexity is 0{m x £). □ 


4.5 Encoding of SPM Constraints 


Usual SPM constraints (see Section 2.2 1 can be reformulated in a straightforward way. 
Let P be the unknown pattern. 

-Minimum size constraint: size{P,£min) = {Pi f LI) 

- Item constraint: let U be a subset of items, I and u two integers s.t. 0 < I < u < £. 
item{P, V) = Atgy Among(P, {f}, I, u) enforces that items of V should occur at least 
I times and at most u times in P. To forbid items of V to occur in P, I and u must be 
set to 0. 

- Regular expression constraint: let Aj-eg be the deterministic finite automaton encoding 
the regular expression exp. reg{P, exp) = Regular(P, A^eg). 


5 Experimental Evaluation 

This section reports experiments on several real-life datasets from 06131181 of large 
size having varied characteristics and representing different application domains (see 
Table 1^. Our objective is (1) to compare our approach to existing CP methods as well 
as to state-of-the-art methods for SPM in terms of scalability which is a major issue 
of existing CP methods, (2) to show the flexibility of our approach allowing to handle 
different constraints simultaneously. 




















5.1 Experimental protocol 


The implementation of our approach was carried out in the Gecode solveiQ All ex¬ 
periments were conducted on a machine with a processor Intel X5670 and 24 GB of 
memory. A time limit of 1 hour has been used. For each dataset, we varied the minsup 
threshold until the methods are not able to complete the extraction of all patterns within 
the time limit, i was set to the length of the longest sequence of SDB. The implemen¬ 
tation and the datasets used in our experiments are available onlin^ We compare our 
approach (indicated by PP) with: 

1. two CP encodings na, the most efficient CP methods for SPM: global-p . f 
and decomposed-p.f; 

2. state-of-the-art methods for SPM : Pref ixSpan and cSpade; 

3. SMA flSl for SPM under regular expressions. 

We used the author’s cSpade implementation|^for SPM, the publicly available im¬ 
plementations of Pref ixSpan by Y. Tabeij^and the SMA implementation]^ for SPM 
under regular expressions. The implementation]^ of the two CP encodings was carried 
out in the Gecode solver. All methods have been executed on the same machine. 


5.2 Comparing with CP Methods for SPM 

First we compare PP with the two CP encodings global-p . f and decomposed-p . f 
(see Section [T2| l. Fig.l^shows the number of extracted sequential patterns and the CPU 
times to extract them (in logscale for BIBLE, Kosarak and PubMed) for the three meth¬ 
ods. 

First, as expected, the lower minsup is, the larger the number of extracted se¬ 
quential patterns. Second, when comparing the CPU times, decomposed-p. f is the 
least performer method. On all the datasets, it fails to complete the extraction within 
the time limit for all values of minsup we considered. Third, PP largely dominates 
global-p. f on all the datasets: PP is more than an order of magnitude faster than 
global-p . f. The gains in terms of CPU times are greatly amplified for low values 
of minsup. On BIBLE (resp. PubMed), the speed-up is 84.4 (resp. 33.5) for minsup 
equal to 1%. Another important observation that can be made is that, on most of the 
datasets (except BIBLE and Kosarak), global-p. f is not able to mine for patterns at 
very low frequency within the time limit. For example on FIFA, PP is able to complete 
the extraction for values of minsup up to 6% in 1,457 seconds, while global-p . f 
fails to complete the extraction for minsup less than 10%. The same trend is also 
conformed on Leviathan, where global-p. f is not able to mine for patterns at 1% 
minimum frequency. 

http: / /WWW. gecode . org 

^ https://sites.google.com/site/prefixprojectionlcp/ 

® http://www.cs.rpi.edu/~zak;i/www-new/pmwiki.php/Software/ 
https : / / code . google, com/p/prefixspan/ 

* http://www-kdd.isti.cnr.it/SMA/ 

®https://dtai.cs.kuleuven.be/CP4IM/cpsm/ 
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Fig. 1: Comparing PP with global-p . f for SPM on real-life datasets: CPU times (top) and 
number of patterns (bottom). 


To complement the results given by Fig.[2 Table [^reports for different datasets and 
different values of minsup, the number of calls to the propagate routine of Gecode 
(column 5), and the number of nodes of the search tree (column 6). First, PP explores 
less nodes than global-p . f . But, the difference is not huge (gains of 45% and 33% 
on FIFA and BIBLE respectively). Second, our approach is very effective in terms of 
number of propagations. For PP, the number of propagations remains small (in thou¬ 
sands for small values of minsup) compared to global-p. f (in millions). This is 
due to the huge number of reified constraints used in global-p . f to encode the sub¬ 
sequence relation. On the contrary, our Prefix-Projection global constraint does 
not require any reihed constraints nor any extra variables to encode the subsequence 
relation. 




































Dataset 

minsup {%) 

#PATTERNS 

CPU times (s) 

#PROPAGATIONS 

#NODES 

PP 

global-p.f 

PP 

global-p.f 

PP 

global-p.f 


20 

938 

8.16 

129.54 

1884 

11649290 

1025 

1873 


18 

1743 

13.39 

222.68 

3502 

19736442 

1922 

3486 

FIFA 

16 

3578 

24.39 

396.11 

7181 

35942314 

3923 

7151 

14 

7313 

44.08 

704 

14691 

65522076 

8042 

14616 


12 

16323 

86.46 

1271.84 

32820 

126187396 

18108 

32604 


10 

40642 

185.88 

2761.47 

81767 

266635050 

45452 

81181 


10 

174 

1.98 

105.01 

363 

4189140 

235 

348 


8 

274 

2.47 

153.61 

575 

5637671 

362 

548 

BIBLE 

6 

508 

3.45 

270.49 

1065 

8592858 

669 

1016 

4 

1185 

5.7 

552.62 

2482 

15379396 

1575 

2371 


2 

5311 

15.05 

1470.45 

11104 

39797508 

7048 

10605 


1 

23340 

41.4 

3494.27 

49057 

98676120 

31283 

46557 


5 

2312 

8.26 

253.16 

4736 

15521327 

2833 

4619 


4 

3625 

11.17 

340.24 

7413 

20643992 

4428 

7242 

PubMed 

3 

6336 

16.51 

536.96 

12988 

29940327 

7757 

12643 

2 

13998 

28.91 

955.54 

28680 

50353208 

17145 

27910 


1 

53818 

77.01 

2581.51 

110133 

124197857 

65587 

107051 


99.99 

127 

165.31 

219.69 

264 

26731250 

172 

221 


99.988 

216 

262.12 

411.83 

451 

44575117 

293 

390 

Protein 

99.986 

384 

467.96 

909.47 

805 

80859312 

514 

679 

99.984 

631 

753.3 

1443.92 

1322 

132238827 

845 

1119 


99.982 

964 

1078.73 

2615 

2014 

201616651 

1284 

1749 


99.98 

2143 

2315.65 

- 

4485 

- 

2890 

- 


1 

384 

2.59 

137.95 

793 

8741452 

482 

769 


0.5 

1638 

7.42 

491.11 

3350 

26604840 

2087 

3271 

Kosarak 

0.3 

4943 

19.25 

1111.16 

10103 

56854431 

6407 

9836 

0.28 

6015 

22.83 

1266.39 

12308 

64003092 

7831 

11954 


0.24 

9534 

36.54 

1635.38 

19552 

81485031 

12667 

18966 


0.2 

15010 

57.6 

2428.23 

30893 

111655799 

20055 

29713 


10 

651 

1.78 

12.56 

1366 

2142870 

849 

1301 


8 

1133 

2.57 

19.44 

2379 

3169615 

1487 

2261 

Leviathan 

6 

2300 

4.27 

32.85 

4824 

5212113 

3008 

4575 

4 

6286 

9.08 

66.31 

13197 

10569654 

8227 

12500 


2 

33387 

32.27 

190.45 

70016 

33832141 

43588 

66116 


1 

167189 

121.89 

- 

350310 

- 

217904 

- 


Table 3: PP vs. global-p . f . 


5.3 Comparing with ad hoc Methods for SPM 

Our second experiment compares PP with state-of-the-art methods for SPM. Fig. 
shows the CPU times of the three methods. First, cSpade obtains the best perfor¬ 
mance on all datasets (except on Protein). However, PP exhibits a similar behavior as 
cSpade, but it is less faster (not counting the highest values of minsup). The behavior 
of cSpade on Protein is due to the vertical representation format that is not appropri¬ 
ated in the case of databases having large sequences and small number of distinct items, 
thus degrading the performance of the mining process. Second, PP which also uses the 
concept of projected databases, clearly outperforms Pref ixSpan on all datasets. This 
is due to our filtering algorithm combined together with incremental data structures to 
manage the projected databases. On FIFA, PrefixSpan is not able to complete the 
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Fig. 2: Comparing Prefix-Projection with state-of-the-art algorithms for SPM. 


extraction for minsup less than 12%, while our approach remains feasible until 6% 
within the time limit. On Protein, PrefixSpan fails to complete the extraction for all 
values of minsup we considered. These results clearly demonstrate that our approach 
competes well with state-of-the-art methods for SPM on large datasets and achieves 
scalability while it is a major issue of existing CP approaches. 

5.4 SPM under size and item constraints 

Our third experiment aims at assessing the interest of pushing simultaneously different 
types of constraints. We impose on the PubMed dataset usual constraints such as the 
minimum frequency and the minimum size constraints and other useful constraints ex¬ 
pressing some linguistic knowledge such as the item constraint. The goal is to retain 
sequential patterns which convey linguistic regularities (e.g., gene - rare disease rela¬ 
tionships) IJl. The size constraint allows to remove patterns that are too small w.r.t. the 
number of items (number of words) to be relevant patterns. We tested this constraint 
with imin set to 3. The item constraint imposes that the extracted patterns must contain 
the item GENE and the item DISEASE. As no ad hoc method exists for this combi¬ 
nation of constraints, we only compare PP with global-p . f. Eig. [^shows the CPU 
times and the number of sequential patterns extracted with and without constraints. 
Eirst, pushing simultaneously the two constraints enables to reduce signihcantly the 
number of patterns. Moreover, the CPU times for PP decrease slightly whereas for 
global-p. f (with and without constraints), they are almost the same. This is prob¬ 
ably due to the weak communication between the m exists-embedding reihed 
global constraints and the two constraints. This reduces signihcantly the quality of the 
whole hltering. Second (see Table [^, when considering the two constraints, PP clearly 
dominates global-p . f (speed-up value up to 51.5). Moreover, the number of prop- 
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Fig. 3: Comparing PP with global-p. f under minimum size and item constraints on PubMed. 


Dataset 

minsup {%) 

#PATTERNS 

CPU times (s) 

#PR0PAGAT10NS 

#NODES 

PP 

global-p.f 

PP 

global-p.f 

PP 

global-p.f 


5 

279 

6.76 

252.36 

7878 

12234292 

2285 

4619 


4 

445 

8.81 

339.09 

12091 

16475953 

3618 

7242 


3 

799 

12.35 

535.32 

20268 

24380096 

6271 

12643 


2 

1837 

20.41 

953.32 

43088 

42055022 

13888 

27910 


1 

7187 

49.98 

2574.42 

157899 

107978568 

52508 

107051 


Table 4: PP vs. global-p . f under minimum size and item constraints. 


agations performed by PP remains very small as compared to global-p. f. Fig. [3^ 
compares the two methods under the minimum size constraint for different values of 
^min, with minsup fixed to 1%. Table [^compares the two methods in terms of num¬ 
bers of propagations (column 5) and number of nodes of the search tree (column 6). 
Once again, PP is always the most performer method (speed-up value up to 53.1). 
These results also confirm what we observed previously, namely the weak communi¬ 
cation between reified global constraints and constraints imposed on patterns (i.e., size 
and item constraints). 


5.5 SPM under regular constraints 

Our last experiment compares PP-REG against two variants of SMA: SMA-IP (SMA 
one pass) and SMA-FC (SMA Full Check). Two datasets are considered from ifTSll : one 
synthetic dataset (data-200k), and one real-life dataset (Protein). For data-200k, we used 
two RE: 

- REIO = A*B{B\C)D*EF*{G\H)r, 

- REM = A*{Q\BS*{B\C))D*E{I\S)*iF\H)G*R. 

For Protein, we used RE2 = {S\T) . {R\K) representing Protein kinase C phos¬ 
phorylation (where . represents any symbol). Fig. |^reports CPU-times comparison. On 
the synthetic dataset, our approach is very effective. For RE14, our method is more than 
an order of magnitude faster than SMA. On Protein, the gap between the 3 methods 
shrinks, but our method remains effective. For the particular case of RE2, the Regular 
constraint can be substituted by restricting the domain of the first and third variables to 
{S, T} and {R, K} respectively (denoted as PP-SRE), thus improving performances. 

































Dataset 

^min 

#PATTERNS 

CPU times (s) 

#PROPAGATIONS 

#NODES 

PP 

global-p.f 

PP 

global-p.f 

PP 

global-p.f 


8 

12 

48.52 

2577.09 

55523 

105343528 

50264 

107051 


6 

3596 

50.91 

2576.9 

59144 

106272419 

50486 

107051 

PubMed 

4 

40669 

70.61 

2579.3 

96871 

117781215 

59194 

107051 

2 

53486 

76.64 

2580.41 

109801 

123913176 

65334 

107051 


1 

53818 

78.49 

2579.85 

110133 

117208559 

65587 

107051 


Table 5: PP vs. global-p. f under minimum size constraint. 
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Fig. 4: Comparing Prefix-Projection with SMA for SPM under RE constraint. 


6 Conclusion 

We have proposed the global constraint Prefix-Projection for sequential pattern 
mining. Preeix-Projection uses a concise encoding and provides an efficient filter¬ 
ing based on specific properties of the projected databases, and anti-monotonicity of 
the frequency constraint. When this global constraint is integrated into a CP solver, it 
enables to handle several constraints simultaneously. Some of them like size, item mem¬ 
bership and regular expression are considered in this paper. Another point of strength, 
is that, contrary to existing CP approaches for SPM, our global constraint does not re¬ 
quire any reified constraints nor any extra variables to encode the subsequence relation. 
Finally, although Preeix-Projection is well suited for constraints on sequences, it 
would require to be adapted to handle constraints on subsequence relations like gap. 

Experiments performed on several real-life datasets show that our approach clearly 
outperforms existing CP approaches and competes well with ad hoc methods on large 
datasets and achieves scalability while it is a major issue of CP approaches. As future 
work, we intend to handle constraints on set of sequential patterns such as closedness, 
relevant subgroup and skypattern constraints. 
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